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We study the online dynamics of learning in fully connected soft committee 
machines in the student-teacher scenario. The locally optimal modulation func- 
tion, which determines the learning algorithm, is obtained from a variational 
argument in such a manner as to maximise the average generalisation error de- 
cay per example. Simulations results for the resulting algorithm are presented 
for a few cases. The symmetric phase plateaux are found to be vastly reduced in 
comparison to those found when online backpropagation algorithms are used. A 
discussion of the implementation of these ideas as practical algorithms is given. 
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Learning how learning occurs in artificial systems has caught the attention of the Statistical Mechanics 
community in the last decade. This interest was ignited by several reasons, among them, the invention of 
efficient learning-from-examples methods such as backpropagation, that permit learning in computationally 
complex machines, to the realisation that ideas from disordered systems, in particular spin glasses, could be 
applied to the study of attractor as well as feedforward neural networks and to the generalised interest in 
complex systems with rugged energy landscapes. 

The main results from the Statistical Mechanics (see e.g. ) approach have almost invariantly been 

obtained in the thermodynamic limit and have benefited from the powerful techniques used to calculate the 
averages over the disorder introduced by the random nature of the examples. 

Among several possible approaches to machine learning, online learning j| has been the subject of an 
intense research effort due to several factors. In this scheme, examples are used only once, thereby avoiding 
the need for expensive memory resources, typical of offline methods. This, however, doesn't translate neces- 
sarily into poor performance since efficient methods can be devised that have performance comparable to the 
memory based ones. Furthermore, learning sequentially from single examples has a greater biological flavor 
than offline processing. While efficiency, computational economy and biological relevance may be the most 
relevant factors, the theoretical possibility of rather complete analytical studies has also played an important 
role. If each one of these factors is, by itself, sufficiently important to make online learning an attractive 
scheme, together they combine to give a most compelling argument for its thorough study. 

In this letter we present results of the optimisation of online supervised learning in a model consisting of 
a fully connected multilayer feedforward neural network, in what has become known as the student-teacher 
scenario. The type of result we present here brings together two separate lines of research that have been 
recently pursued by several groups. 

The study of online backpropagation as put forward by Bichl and Schwarze || and later developed in [^j] 
has permitted the analytical understanding of several properties of the dynamics of the learning process. 
The most striking feature being the existence of learning plateaux or symmetric phases which signal learning 
stages where the information available to the student and the form in which it is used do not permit breaking 
the permutation symmetry among the hidden nodes. Further learning eventually permits the escape from 
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the neighbourhood of these repulsive symmetric fixed points into the broken symmetry, specialized phase. 
The onset of specialization and different methods to hasten it have been dealt with by sever al authors [| 0. 

The second line of research from which we draw is the variational study of locally optimal online learning. 
This program deals with the determination of lower bounds for the generalisation errors in different models 
in controlled learning scenarios. The constructive nature of the variational approach has permitted finding 
update rules that lead to student networks with the optimal generalisation performance. The relation of this 
approach to Bayesian methods has been discussed in Q and in |p^| . 

The variational method has been previously applied to machines with no internal units |l4|-|l(| or with 
hidden units but nonoverlapping receptive fields (RF) Jl7]|l8| and also in the case of unsupervised learning 
. We will introduce the variational method for feedforward machines with overlapping RF. The differences 
stem from the fact that while in the former case the generalization error is a monotonic decreasing function of 
the order parameters (student-teacher overlaps), in the latter, the monotonicity is lost, due to the appearance 
of crossed overlaps. 

The main results here presented are the analysis of the locally optimized online learning dynamics of 
a soft committee. We present results for over-realisable and realisable cases. The striking reduction or 
complete elimination of the plateaux in the learning curves witnesses the great improvement achievable by 
concentrating in extracting the largest possible amount of information from each example. Rapid escape 
from the plateaux can be attributed to a fluctuation enhancing mechanism that stimulates permutational 
symmetry breaking. 

The aim of learning is to obtain a set of student weights Jik where i{— 1, N) indexes input layer units 
and fc(= 1, ...K) hidden nodes, in such a manner that the student implements as closely as possible the map 
represented by the teacher network defined by a set of weights B ini where i(= l,...,N) labels the input 
layer unit and n(— 1, ...M) the hidden node. We use n,m, ... to label teacher branches and j, k, ... for the 
student branches. Call B„ = (£?i„, £?2n, -Bjvn), Jfc = (Jik, Jik-, Jnu) the weight branch vectors and B n 
and Jfc their respective lengths. We define as usual the order parameters Rk n = Jfe • B„, Qij = Ji • Jj and 
M nm — B„ • B m which will be taken M nm — 6 nm , for simplicity. 

At each time step /z, an example S M is drawn from a known distribution P(S). We call and £j the 
teacher and student outputs respectively. The internal fields are denoted by — B„ ■ S M and x% = Jfc • S M . 
The available information is used in updating the student weights Jik, 

^ + 1) = J ife (/x) + §sf (i) 



This is not the most general update possible since a decay term, useful in controlling the length of can be 
used, we however will not pursue this direction here. The central quantity in this theoretical approach is the 
set of modulation functions F = (i<\, F 2 , . . . , Fk). The following analysis will be done in the thermodynamic 
limit. For any transfer function, the evolution of the order parameters is given by a set of (K 2 + K)/2 + KM 
first order differential equations. For fully connected architectures we have: 

= {ynFlh = + X]Fl + FiFj) (2) 



where as usual, a — fi/N measures the learning time. We now proceed, first to obtain the best F, from a 
generalisation point of view, and then to analyse the dynamical consequences that such a choice will have. 

A point of technical importance, which in no way restricts the validity of the general properties of the 
results here discussed, concerns the choice of an error function for the sigmoidal transfer function g of the 
internal units and a linear transfer function for the output unit, following p], since it permits better analytical 
tractability. Thus = E„ =1 ,...,m cr/(^/v^) and Sj = £fc=i,..,x er/«/^). 

For a fixed teacher, the student network will have a generalisation error e g (Jk) = {\{^% — ^/) 2 )s- In the 
thermodynamic limit, for a uniform distribution of examples, the generalisation error can be written as a 
function of the order parameters: 
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Local optimisation is obtained by maximizing the average generalisation error decay for each example in a 



given state (RkmQij)- We thus look, following |14|, at the extremes of the functional e 9 [F] = de g \F]/da 



that is, the modulation function F which satisfies Se g /SFk = 0. The solution has the general form 

F = H- 1 G(y) W | V -x (4) 

where H, the functional Hessian matrix and G are defined as i?y = 5 2 e g /SFiSFj and Gk n = — de g / dRk n and 
the conditional expectation is taken with respect to the assumed examples' probability distribution P(S). 
The symbols 7i and V stand for the set of Hidden or Visible information. It is interesting to note that (0) 
holds for any choice of transfer function or examples' distribution. For the particular case of examples drawn 
independently from a uniform spherical distribution, we have to solve integrals of the form: 

J II p c(x, y)vtJ{^ B - 5>/(^/V2)) (5) 



where e = 0, 1 and -Pc(x, y) is a (K + M) multivariate gaussian with correlation matrix 



We now present results obtained by simulating an N = 5000 system for the cases K = 2, M — 1 and 
K = M = 2. Further details will be presented elsewhere ^0|. In figures 1 and 2 we show the learning curves 
for these two cases. Backpropagation results are included for comparison. Figure 3 shows the evolution of 
Qik for the K = M = 2 case and suggests that the mechanism used to enhance fluctuations and break the 
permutation symmetry is to increase synaptic vector norms and stimulate anti-correlated weights. 

Whether this solution of the variational problem leads to a maximum generalisation or not will be governed 
by the functional Hessian matrix H. Note that the dependence of the dynamics on the modulation function 
is only second order, therefore H is a function of the order parameters and not explicitly of the particular 
algorithm that led to that state of affairs. A negative eigenvalue of H at a given point in the space of order 
parameters implies that at that point an optimal algorithm can not be analytically found. 

The evolution of the eigenvalues for both cases is shown as insets in figures 1 and 2. In the space of 
algorithms, for both cases, at the beginning of the learning process these modulation functions represent 
saddle points rather than maxima. For the case 

K = 2, M = I this can be explained as follows. The best generalisation would be obtained by using a 
correct architecture, K = M = 1, thus the optimal strategy is to trim the student into the correct architecture 
and then proceed with the optimized no 

nlinear perceptron algorithm which could then be obtained by the above variational method. This kind 
of modulation function cannot be obtained analytically by searching for zero derivatives in the space of 
algorithms of the K — 2 student. The solution found by our method does cut out one of the branches 
around a ~ \ and turns itself into an effectively K = 1 machine quite rapidly, avoiding the long plateau of 
the backpropagation algorithm. 

The explanation for the initially negative eigenvalue of H in the K = M = 2 case is not different. The 
optimal strategy is within the space of students with a K = 2 architecture and asymmetric inital conditions, 
and thus it will not be found by the variational approach. Before there is any information to hint that the 
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permutation symmetry should be broken, it is more efficient locally to learn with a K = 1 machine (with an 
output multiplied by 2). This is however not true after a while, since thus it will never escape the plateau. 
Since the escape is achieved by amplification of symmetry breaking fluctuations, learning initially with a 
nonlinear perceptron cannot be globally efficient, for it totally suppresses the desired effect of fluctuations. 

A null eigenvalue of H indicates the existence of a class of algorithms with identical performance to that 
of eq.|], this can be interpreted as a kind of functional robustness. An example of this appears for the case 
K = M = 2, where the smallest eigenvalue stays very close to zero in the plateau state (see fig 2.). The 
significance of this is that due to functional robustness, the exact determination of the modulation function 
is not very critical for learning and eventually escaping the plateau. 

Although our method permits the locally optimal extraction of information from an example, it does not 
assure that the system will follow the best global trajectory in the space of order parameters. The global 
functional optimisation has been recently addressed in pl[ | . They have shown the equivalence between local 
and global optimisations for the boolean perceptron and the better performance of the global approach in 
K=3, M=l case. A thorough investigation on how global and local optmisations are related is an important 
issue and remains to be done. 

The effects of finite size N have not been systematically investigated and therefore the advantages of these 
methods, if any, over conventional algorithms remains to be proved. Nevertheless, learning is easier in smaller 
networks and a straightforward use of the modulation function in regimes where the central limit theorem 
cannot yet be used leads to a successful learning prescription as can be seen from simulating learning for the 
rather small network with N = 15, K = M = 2 [f20| . 

The main difficulties of using this approach to construct practical algorithms concern the assumed knowl- 
edge of several unavailable quantities. First of all the examples' probability distribution is needed in order to 
calculate the integrals in equation (Q). Then, the resulting modulation function depends on unknown order 
parameters, such as Rim and worst, these order parameters are only self-averaging in the thermodynamic 
limit. We first discuss rapidly the first two points. Optimality is hard to define, several different possible 
criteria lead to different results. Also, given a definition, such as the one we use here of maximizing general- 
ization, the optimal prescription will depend on the amount of available information and on the environment 
where learning takes place. Although we do not attempt to solve these problems here, a short digression 
is in order. A parametric representation of P{S,Yib) ~ P W (S', Eb) permits introducing an extra set of p 
differential equations for the online estimation of the distribution parameters w —(w\,W2, ...,w p ). Also the 
order parameters can be analogously estimated online, as has been done in [22]| , even in the case of time 
dependent or drifting rules. 

How robust these "optimal" algorithms are in the absence or misestimation of this information, as well as 
its response to learning in noisy environments remains to be seen. The last issue has been addressed recently 
in |23| ] for boolean machines. They found a large robustness to noise-level-misestimation, as well as efficient 
online noise level estimators which manage to steer the dynamics into an efficient learning phase. 

These comments about the need for extra knowledge to implement these methods as algorithms can be 
seen as drawbacks for the variational program. We rather think of them as calling our attention to the 
further work that has to be done in order to obtain efficient adaptive practical algorithms, and pointing out 
directions in which these objectives can be reached. Whatever point of view is chosen, the validity of these 
results and their relation to improving the generalisation ability remains. 
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FIG. 1. Generalization error Learning Curves for the K = 2, M = 1 case obtained by simulating a system of 
N — 5000 with random initial conditions Qn 6 [0, .5], Q22 € [0, IE — 6] and Q12 ~ 0. Black circles: optimized 
algorithm, white circles: conventional backpropagation with learning rate r\ = 1.5 . Inset: Eigenvalues of the Hessian 
H. There is a transient where the smallest eigenvalue is negative, it then crosses rapidly into positive vales. 
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FIG. 2. Same as fig. 1 but for the K = M = 2. Inset: Eigenvalues of the Hessian H. Note that the smallest 
eigenvalue stays very close to zero in the plateau. 
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FIG. 3. Evolution of the overlaps. Note the anticorrelation that builds up during the transient. Inset: Details of 
the escape from the plateau. 
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